Blame: archived/ingest_text_data.ipynb - aws/amazon-sagemaker-examples

aws / amazon-sagemaker-examples UNCLAIMED

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

0 0 1 Jupyter Notebook

Normal View History Raw

New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com> 2024-07-26 14:42:50 -07:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Ingest Text Data\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"---\n",`
			`"\n",`
			`"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",`
			`"\n",`
			`"![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"---"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"Labeled text data can be in a structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. In these cases, you may have one column for the label, one column for the text, and sometimes other columns for attributes. You can treat this structured data like tabular data. Sometimes text data, especially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a SageMaker Notebook in this section.\n"
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Set Up Notebook"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"%pip install -q 's3fs==0.4.2'"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import pandas as pd\n",`
			`"import json\n",`
			`"import glob\n",`
			`"import s3fs\n",`
			`"import sagemaker"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Get SageMaker session & default S3 bucket\n",`
			`"sagemaker_session = sagemaker.Session()\n",`
			`"bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one\n",`
			`"s3 = sagemaker_session.boto_session.resource(\"s3\")\n",`
			`"\n",`
			`"prefix = \"text_spam/spam\"\n",`
			`"prefix_json = \"json_jeo\"\n",`
			`"filename = \"SMSSpamCollection.txt\"\n",`
			`"filename_json = \"JEOPARDY_QUESTIONS1.json\""`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Downloading data from Online Sources\n",`
			`"\n",`
			`"### Text data (in structured .csv format): Twitter -- sentiment140\n",`
			`" Sentiment140 This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter API. The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns:\n",`
			"* `target`: the polarity of the tweet (0 = negative, 4 = positive)\n",
			"* `ids`: The id of the tweet ( 2087)\n",
			"* `date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n",
			"* `flag`: The query (lyx). If there is no query, then this value is NO_QUERY.\n",
			"* `user`: the user that tweeted (robotickilldozr)\n",
			"* `text`: the text of the tweet (Lyx is cool\n",
			`"\n",`
			`"[Second Twitter data](https://github.com/guyz/twitter-sentiment-dataset) is a Twitter data set collected as an extension to Sanders Analytics Twitter sentiment corpus, originally designed for training and testing Twitter sentiment analysis algorithms. We will use this data to showcase how to aggregate two data sets if you want to enhance your current data set by adding more data to it."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# helper functions to upload data to s3\n",`
			`"def write_to_s3(filename, bucket, prefix):\n",`
			`" # put one file in a separate folder. This is helpful if you read and prepare data with Athena\n",`
			`" key = \"{}/{}\".format(prefix, filename)\n",`
			`" return s3.Bucket(bucket).upload_file(filename, key)\n",`
			`"\n",`
			`"\n",`
			`"def upload_to_s3(bucket, prefix, filename):\n",`
			`" url = \"s3://{}/{}/{}\".format(bucket, prefix, filename)\n",`
			`" print(\"Writing to {}\".format(url))\n",`
			`" write_to_s3(filename, bucket, prefix)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# run this cell if you are in SageMaker Studio notebook\n",`
			`"#!apt-get install unzip"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# download first twitter dataset\n",`
			`"!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip\n",`
			`"# Uncompressing\n",`
			`"!unzip -o sentimen140.zip -d sentiment140"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# upload the files to the S3 bucket\n",`
			`"csv_files = glob.glob(\"sentiment140/*.csv\")\n",`
			`"for filename in csv_files:\n",`
			`" upload_to_s3(bucket, \"text_sentiment140\", filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# download second twitter dataset\n",`
			`"!wget https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"filename = \"full-corpus.csv\"\n",`
			`"upload_to_s3(bucket, \"text_twitter_sentiment_2\", filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Text data (in .txt format): SMS Spam data \n",`
			`"[SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. Each line in the text file has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip\n",`
			`"!unzip -o spam.zip -d spam"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"txt_files = glob.glob(\"spam/*.txt\")\n",`
			`"for filename in txt_files:\n",`
			`" upload_to_s3(bucket, \"text_spam\", filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Text Data (in .json format): Jeopardy Question data\n",`
			`"[Jeopardy Question](https://j-archive.com/) was obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:\n",`
			`"\n",`
			"* `category` : the question category, e.g. \"HISTORY\"\n",
			"* `value`: dollar value of the question as string, e.g. \"\\$200\"\n",
			"* `question`: text of question\n",
			"* `answer` : text of answer\n",
			"* `round`: one of \"Jeopardy!\",\"Double Jeopardy!\",\"Final Jeopardy!\" or \"Tiebreaker\"\n",
			"* `show_number` : string of show number, e.g '4680'\n",
			"* `air_date` : the show air date in format YYYY-MM-DD"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# json file format\n",`
			`"! wget 'https://docs.google.com/uc?export=download&id=0BwT5wj_P7BKXb2hfM3d2RHU1ckE' -O JEOPARDY_QUESTIONS1.json\n",`
			`"# Uncompressing\n",`
			`"filename = \"JEOPARDY_QUESTIONS1.json\"\n",`
			`"upload_to_s3(bucket, \"json_jeo\", filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Ingest Data into Sagemaker Notebook\n",`
			`"## Method 1: Copying data to the Instance\n",`
			`"You can use the AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance. This is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Specify file names\n",`
			`"prefix = \"text_spam/spam\"\n",`
			`"prefix_json = \"json_jeo\"\n",`
			`"filename = \"SMSSpamCollection.txt\"\n",`
			`"filename_json = \"JEOPARDY_QUESTIONS1.json\"\n",`
			`"prefix_spam_2 = \"text_spam/spam_2\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# copy data to your sagemaker instance using AWS CLI\n",`
			`"!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data_location = \"text/{}/{}\".format(prefix_json, filename_json)\n",`
			`"with open(data_location) as f:\n",`
			`" data = json.load(f)\n",`
			`" print(data[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Method 2: Use AWS compatible Python Packages\n",`
			"When you are dealing with large data sets, or do not want to lose any data when you delete your Sagemaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as `Pandas`, have implemented options to access data with a specified path string: while you will use `file://` on your local file system, you will use `s3://` instead to access the data through the AWS boto library. For `pandas`, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. You can find additional documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). \n",
			`"\n",`
			`"For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specifying a delimiter."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n",`
			`"s3_tabular_data = pd.read_csv(data_s3_location, sep=\"\\t\", header=None)\n",`
			`"s3_tabular_data.head()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"For JSON files, depending on the structure, you can also use `Pandas` `read_json` function to read it if it's a flat json file."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data_json_location = \"s3://{}/{}/{}\".format(bucket, prefix_json, filename_json)\n",`
			`"s3_tabular_data_json = pd.read_json(data_json_location, orient=\"records\")\n",`
			`"s3_tabular_data_json.head()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Method 3: Use AWS Native methods\n",`
			`"#### s3fs\n",`
			`"[S3Fs](https://s3fs.readthedocs.io/en/latest/) is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"fs = s3fs.S3FileSystem()\n",`
			`"data_s3fs_location = \"s3://{}/{}/\".format(bucket, prefix)\n",`
			`"# To List all files in your accessible bucket\n",`
			`"fs.ls(data_s3fs_location)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# open it directly with s3fs\n",`
			`"data_s3fs_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n",`
			`"with fs.open(data_s3fs_location) as f:\n",`
			`" print(pd.read_csv(f, sep=\"\\t\", nrows=2))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Aggregating datasets\n",`
			`"If you would like to enhance your data with more data collected for your use cases, you can always aggregate your newly-collected data with your current dataset. We will use two datasets -- Sentiment140 and Sanders Twitter Sentiment to show how to aggregate data together."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"prefix_tw1 = \"text_sentiment140/sentiment140\"\n",`
			`"filename_tw1 = \"training.1600000.processed.noemoticon.csv\"\n",`
			`"prefix_added = \"text_twitter_sentiment_2\"\n",`
			`"filename_added = \"full-corpus.csv\""`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Let's read in our original data and take a look at its format and schema:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data_s3_location_base = \"s3://{}/{}/{}\".format(bucket, prefix_tw1, filename_tw1) # S3 URL\n",`
			`"# we will showcase with a smaller subset of data for demonstration purpose\n",`
			`"text_data = pd.read_csv(\n",`
			`" data_s3_location_base, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",`
			`")\n",`
			`"text_data.columns = [\"target\", \"tw_id\", \"date\", \"flag\", \"user\", \"text\"]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"We have 6 columns, `date`, `text`, `flag` (which is the topic the twitter was queried), `tw_id` (tweet's id), `user` (user account name), and `target` (0 = neg, 4 = pos)."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text_data.head(1)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Let's read in and take a look at the data we want to add to our original data. \n",`
			`"\n",`
			"We will start by checking for columns for both data sets. The new data set has 5 columns, `TweetDate` which maps to `date`, `TweetText` which maps to `text`, `Topic` which maps to `flag`, `TweetId` which maps to `tw_id`, and `Sentiment` mapped to `target`. In this new data set, we don't have `user account name` column, so when we aggregate two data sets we can add this column to the data set to be added and fill it with `NULL` values. You can also remove this column from the original data if it does not provide much valuable information based on your use cases. "
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data_s3_location_added = \"s3://{}/{}/{}\".format(bucket, prefix_added, filename_added) # S3 URL\n",`
			`"# we will showcase with a smaller subset of data for demonstration purpose\n",`
			`"text_data_added = pd.read_csv(\n",`
			`" data_s3_location_added, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text_data_added.head(1)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"#### Add the missing column to the new data set and fill it with `NULL`"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text_data_added[\"user\"] = \"\""`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"#### Renaming the new data set columns to combine two data sets"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text_data_added.columns = [\"flag\", \"target\", \"tw_id\", \"date\", \"text\", \"user\"]\n",`
			`"text_data_added.head(1)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"#### Change the `target` column to the same format as the `target` in the original data set\n",
			"Note that the `target` column in the new data set is marked as \"positive\", \"negative\", \"neutral\", and \"irrelevant\", whereas the `target` in the original data set is marked as \"0\" and \"4\". So let's map \"positive\" to 4, \"neutral\" to 2, and \"negative\" to 0 in our new data set so that they are consistent. For \"irrelevant\", which are either not English or Spam, you can either remove these if it is not valuable for your use case (In our use case of sentiment analysis, we will remove those since these text does not provide any value in terms of predicting sentiment) or map them to -1. "
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# remove tweets labeled as irelevant\n",`
			`"text_data_added = text_data_added[text_data_added[\"target\"] != \"irelevant\"]\n",`
			`"# convert strings to number targets\n",`
			`"target_map = {\"positive\": 4, \"negative\": 0, \"neutral\": 2}\n",`
			`"text_data_added[\"target\"] = text_data_added[\"target\"].map(target_map)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"#### Combine the two data sets and save as one new file"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text_data_new = pd.concat([text_data, text_data_added])\n",`
			`"filename = \"sentiment_full.csv\"\n",`
			`"text_data_new.to_csv(filename, index=False)\n",`
			`"upload_to_s3(bucket, \"text_twitter_sentiment_full\", filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Citation\n",`
			`"Twitter140 Data, Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.\n",`
			`"\n",`
			`"SMS Spaming data, Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.\n",`
			`"\n",`
			`"J! Archive, J! Archive is created by fans, for fans. The Jeopardy! game show and all elements thereof, including but not limited to copyright and trademark thereto, are the property of Jeopardy Productions, Inc. and are protected under law. This website is not affiliated with, sponsored by, or operated by Jeopardy Productions, Inc."`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Notebook CI Test Results\n",`
			`"\n",`
			`"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",`
			`"\n",`
			`"![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n",`
			`"\n",`
			`"![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/ingest_data\|ingest-data-types\|ingest_text_data.ipynb)\n"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (Data Science 3.0)",`
			`"language": "python",`
			`"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.6"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`